cuda.core: convert peer_accessible_by to a live MutableSet view by Andy-Jost · Pull Request #2018 · NVIDIA/cuda-python

Andy-Jost · 2026-05-04T21:31:28Z

Summary

DeviceMemoryResource.peer_accessible_by previously returned a sorted tuple[int, ...] backed by a Python-level cache. This is fragile in two ways: the cache can diverge from driver state across multiple wrappers around the same memory pool (#1720), and tuple semantics force callers into bespoke splice/replace patterns instead of standard set operations. This PR replaces the property with a live driver-backed collections.abc.MutableSet view, matching the proxy patterns established by AdjacencySetProxy (graphs) and the in-flight AccessedBySet (managed-memory advice, #1775).

This is a breaking change, captured in the v1.0.0 release notes.

Changes

New: `PeerAccessibleBySetProxy`

A MutableSet subclass living in cuda_core/cuda/core/_memory/_peer_access_utils.pyx.

Reads (__contains__, __iter__, __len__) call cuMemPoolGetAccess.
Writes (add, discard, and bulk ops) call cuMemPoolSetAccess.
Iteration yields Device objects.
add, discard, __contains__ accept Device | int.
The owner device is silently filtered (matching the existing planner contract).
All bulk operations (update, |=, &=, -=, ^=, clear) issue exactly one cuMemPoolSetAccess call. This matters: peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel.

The proxy is constructed fresh on every property access. There is nothing to cache or pickle.

Cache removal

The _peer_accessible_by field on DeviceMemoryResource is dropped, along with its initializations in __cinit__, _DMR_init, and from_allocation_handle. The owned/non-owned read split is gone — every read path now queries the driver directly. This eliminates the bug class fixed in #1720 and simplifies the implementation.

Module consolidation

All peer-access internals now live in _peer_access_utils.pyx (promoted from a .py). The cdef inline driver helpers (_query_peer_access_ids, _peer_access_includes, _set_pool_access) and the cpdef replace_peer_accessible_by setter helper moved out of _device_memory_resource.pyx. DeviceMemoryResource now exposes only the public peer_accessible_by property: the getter returns PeerAccessibleBySetProxy(self) and the setter delegates to replace_peer_accessible_by(self, devices). This keeps the DMR surface focused on its memory-management role.

Property setter

mr.peer_accessible_by = [...] is preserved and unchanged in behavior. It still does a single batched cuMemPoolSetAccess via the same shared plan_peer_access_update path the proxy uses for bulk ops.

`_query_peer_access_ids` GIL optimization

The driver loop that enumerates peer device IDs now runs inside a single nogil block instead of acquiring/releasing the GIL once per device. Per-call data is collected into a libcpp.vector[int], and because range(total) ascends the result is already sorted (so the trailing sorted() is gone). This is a local performance tweak with no behavior change.

Test coverage

The existing integration tests (test_memory_peer_access.py, memory_ipc/test_peer_access.py) migrated from tuple equality to set-of-Device equality. Eight new tests pin the proxy's contract end-to-end, all using the existing mempool_device_x2 fixture so they run on CI:

MutableSet conformance. A new assert_single_member_mutable_set_interface(subject, member, non_member) helper exercises every MutableSet method against subjects whose backing store admits at most one insertable element (the peer-access proxy can hold at most {dev1} on a 2-GPU box). The original assert_mutable_set_interface(subject, items) is unchanged for normal multi-element subjects (AdjacencySetProxy).
Device/int interchangeability on add/discard/__contains__.
Owner-filtering contract on every write path (silent no-op).
Error paths: add(out_of_range) and add(non_coercible) raise; discard/__contains__ swallow the same inputs; remove(non_member) raises KeyError.
Live driver view: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step.
Iteration order is ascending by device_id; elements are Device instances; __repr__ shape; getter return type.
Single-call batching via a monkeypatch spy on _set_pool_access (the thin Python-visible helper extracted from _apply_peer_access_diff purely so tests can spy on the actual driver call). Every bulk op (|=/&=/-=/^=/update/difference_update/clear) and the property setter is asserted to issue exactly one cuMemPoolSetAccess, zero on no-op. This includes the dmr.peer_accessible_by |= {...} augmented-assignment-on-property pattern, where the empty-diff short-circuit prevents a second driver call from the trailing setter write.

Breaking change

DeviceMemoryResource.peer_accessible_by no longer returns a tuple[int, ...]. Callers must update:

Tuple comparisons (mr.peer_accessible_by == (0,)) -> set comparisons (mr.peer_accessible_by == {Device(0)}).
Iteration that expected ints -> iteration over Device objects (use [d.device_id for d in mr.peer_accessible_by] if int IDs are needed).

Setter usage (mr.peer_accessible_by = [...]) is unchanged.

Test plan

Existing test_memory_peer_access.py tests pass after migration to set semantics.
Existing test_memory_peer_access_utils.py tests still pass (plan_peer_access_update and normalize_peer_access_targets are unchanged).
Existing memory_ipc/test_peer_access.py tests pass (IPC-enabled pools, parent-child IPC scenarios).
Full cuda_core test suite passes on a single-GPU box (2931 passed, 212 skipped — peer-access tests gated on mempool_device_x2/x3).
All new peer-access tests pass on a 2-GPU box.
pre-commit run --all-files passes (ruff, format, SPDX, cython-lint, RST checks).

Related work

[BUG]: DeviceMemoryResource should query driver for peer access state on non-owned pools #1720 — root cause of the cache-divergence bug; this PR removes the cache entirely.
Add managed-memory advise, prefetch, and discard-prefetch free functions #1775 — Rob Parolin's managed-memory AccessedBySet, the second driver-backed set proxy in cuda.core.
cuda_core/cuda/core/graph/_graph_node.pyx (AdjacencySetProxy) — reference pattern for driver-backed set proxies. Same MutableSet shape, same _from_iterable plain-set fallback, same single-driver-call bulk-op contract.

DeviceMemoryResource.peer_accessible_by previously returned a sorted tuple[int, ...] backed by a Python-level cache, which was prone to divergence from driver state across multiple wrappers around the same memory pool. The setter accepted Device | int and emitted a single batched cuMemPoolSetAccess covering the diff against the cache. This commit replaces the property with a live driver-backed view: - Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a collections.abc.MutableSet whose reads call cuMemPoolGetAccess and whose writes call cuMemPoolSetAccess. Iteration yields Device objects; add, discard, and __contains__ accept either a Device or a device-ordinal int. The proxy is constructed fresh on every property access, so there is nothing to cache or pickle. - Drops the _peer_accessible_by cache field (and its initializations in __cinit__, _DMR_init, and from_allocation_handle), eliminating the owned/non-owned read split. All pools now share the same code path and always query the driver. - All bulk operations on the proxy (update, |=, &=, -=, ^=, clear, pop) issue exactly one cuMemPoolSetAccess call. Peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel. The property setter (mr.peer_accessible_by = [...]) preserves its original single-call behavior via the same shared planner path. - Single-element add validates can_access_peer through plan_peer_access_update, matching the existing setter contract. This is a breaking change captured in the v1.0.0 release notes. Callers comparing against tuples must update to set comparisons (mr.peer_accessible_by == {Device(0)}). Existing tests are migrated; new tests for set-interface conformance are intentionally deferred to a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>

copy-pr-bot · 2026-05-04T21:31:32Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

leofang · 2026-05-04T22:16:43Z

FYI: A breaking change must be labeled with P0.

…s.pyx The previous commit left DeviceMemoryResource carrying three pass-through def methods (_query_peer_access_ids, _peer_access_includes, _apply_peer_access_diff) whose only purpose was to give the pure-Python proxy in _peer_access_utils.py a way to call cdef helpers in _device_memory_resource.pyx. These methods served no public role and cluttered the class API. Promote _peer_access_utils.py to a Cython module so the proxy and the driver-touching helpers can live together: - Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct CUmemAccessDesc packing identically to before. - Move _DMR_query_peer_access_ids, _DMR_peer_access_includes, _DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from _device_memory_resource.pyx into the new module as cdef helpers (and a cpdef replace_peer_accessible_by entry point used by the property setter). - Drop the three pass-through def methods from DeviceMemoryResource. The class is left with the property getter and setter only; everything else is module-level in _peer_access_utils. - The proxy now calls the module-level cdef helpers directly instead of routing through methods on mr. No behavior change. The public surface (PeerAccessibleBySetProxy, plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan) is preserved at the same import paths. Co-authored-by: Cursor <cursoragent@cursor.com>

Refactor _query_peer_access_ids so the entire driver loop runs inside a single nogil block instead of acquiring/releasing the GIL once per device. The flag query now uses a cached as_cu(mr._h_pool) handle and fills a libcpp.vector[int]; because range(total) ascends, the result is already sorted and the trailing sorted() call is dropped. Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the breaking-change blurb only needs to state the type/element change, so remove the implementation-flavored details about input acceptance and batched cuMemPoolSetAccess calls. Co-authored-by: Cursor <cursoragent@cursor.com>

…ge cases Existing peer-access tests covered the integration path well (real copies across peers, the full transition matrix, shared-pool consistency) but only touched ``in``, ``==``, and the property setter on the new set proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods, nothing was pinning the type-coercion contract, the owner-filtering behavior, the ``KeyError``/value-error paths, or the "one ``cuMemPoolSetAccess`` per bulk op" performance invariant. Add the following coverage in ``test_memory_peer_access.py``: - A ``MutableSet`` conformance test using a relaxed ``assert_mutable_set_interface`` mode that admits subjects holding at most one insertable element. CI maxes at two GPUs (one peer), so the multi-element protocol pass cannot run there. The new ``support_multi_insert=False`` path takes one insertable item plus two non-member sentinels and exercises every ``MutableSet`` method (``add``/``discard``/``remove``/``pop``/``clear``/``update``, comparisons, isdisjoint, subset/superset, binary and in-place operators, ``__iter__``/``__len__``/``__repr__``). - ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``. - The owner-device filtering contract on every write (silent no-op). - Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise while the lenient ``discard``/``__contains__`` paths swallow the same inputs; ``remove(non_member)`` raises ``KeyError``. - "Live driver view" semantics: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step. - ``__iter__`` ordering is ascending by ``device_id`` and elements are ``Device`` instances; ``__repr__`` includes the class name and tracks live contents; the getter returns the documented proxy type. - A batching spy that monkeypatches the module-level ``_apply_peer_access_diff`` and asserts that every bulk op (``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/ ``clear``) and the property setter issues at most one driver call, zero when the diff is empty. To make the spy possible, ``_apply_peer_access_diff`` is now a Python-visible ``def`` wrapper around a renamed ``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the property setter still call ``_apply_peer_access_diff`` by bare name, which Cython resolves through the module's globals at runtime, so a ``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)`` intercepts them. The extra Python-level dispatch is negligible next to ``cuMemPoolSetAccess`` itself. Co-authored-by: Cursor <cursoragent@cursor.com>

Augmented assignment on the ``peer_accessible_by`` property (``dmr.peer_accessible_by |= {...}``) is two trips through the proxy/setter pair, not one: Python fetches the proxy, the proxy mutates itself in place via ``__ior__``, and Python then assigns the (already-mutated) proxy back through the setter. That trailing setter call computes the diff against current driver state, finds it empty, and short-circuits inside the ``cdef inline`` before issuing any ``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one batched call per bulk op") still holds, but the wrapper is invoked twice, which the spy was counting. Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already empty pool is itself an empty-delta wrapper call. Filter the recorded calls down to those with non-empty deltas (the ones that translate to real driver work) and switch the bulk-ops test to use a locally bound proxy so augmented assignment goes through ``__ior__`` once with no extra setter invocation. The setter test stays on ``dmr.peer_accessible_by = ...`` because that is the public API contract under test there. Co-authored-by: Cursor <cursoragent@cursor.com>

…hing Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array build + driver call) into a thin Python-visible ``def _set_pool_access`` in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only the empty-diff short-circuit and delegates the work to ``_set_pool_access``, which Cython resolves through the module globals at runtime so tests can intercept it via ``monkeypatch.setattr``. Replace the previous internal-wrapper spy with a driver-call spy that counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op layers (e.g. the augmented-assignment-on-property pattern that writes an already-mutated proxy back through the setter) short-circuit before reaching ``_set_pool_access``, so the recorded count is exactly the number of driver calls. The empty-delta filter and the local-binding workaround in the bulk-ops test are gone; we now also assert that ``dmr.peer_accessible_by |= {...}`` directly on the property is still exactly one driver call. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the ``support_multi_insert`` flag and ``non_members`` keyword with two purpose-built helpers: - ``assert_mutable_set_interface(subject, items)`` keeps the original signature and contract: at least five distinct insertable items, exercised against a reference set in the standard way. The graph ``AdjacencySetProxy`` test continues to use this unchanged. - ``assert_single_member_mutable_set_interface(subject, member, non_member)`` is a focused pass for proxies whose backing store admits at most one insertable element at a time (here, the peer-access view on a 2-GPU box). It threads a single member and one non-member sentinel through every ``MutableSet`` method. The two helpers share small private utilities (empty-state checks, ``__repr__`` shape) but keep their public surfaces small and linear. A capacity-one proxy is a meaningfully different contract from a general mutable set; naming that explicitly in the API reads better than a flag and avoids forcing call sites to plumb sentinels through. Co-authored-by: Cursor <cursoragent@cursor.com>

Andy-Jost · 2026-05-05T00:26:33Z

/ok to test

Andy-Jost · 2026-05-05T00:36:16Z

+def assert_single_member_mutable_set_interface(subject, member, non_member):
+    """Exercise every MutableSet method on a subject with capacity one.
+
+    Use this for proxies whose backing store admits at most one insertable
+    element at a time (typically because the underlying resource is bounded
+    by hardware, e.g. a peer-access view on a system with a single valid
+    peer device). The subject only ever holds ``set()`` or ``{member}``;
+    *non_member* supplies the right-hand side of comparisons, ``isdisjoint``,
+    subset/superset, and binary/in-place operators so every ``MutableSet``
+    method is exercised at least once.


@rparolin you might be able to use this to test AccessedBySetProxy in #1775.

github-actions · 2026-05-05T00:45:59Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2018/
https://nvidia.github.io/cuda-python/pr-preview/pr-2018/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2018/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2018/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

leofang · 2026-05-05T02:25:48Z

Q: It is a big diff with a very niche use case. Are we sure this is P0?

Andy-Jost · 2026-05-05T14:24:56Z

Q: It is a big diff with a very niche use case. Are we sure this is P0?

The peer access list is currently a tuple. This change aligns with the set proxies in the graph module (AdjacencySetProxy) and the managed memory hint functions (AccessedBySetProxy). Without this, users would have to remember which interfaces uses a proxy and which do not. I think as a matter of course we should favor container proxies; they are more Pythonic.

This change is not functionally as big as it might appear. (1) Most of the bulk is just code movement: I moved the peer access implementation from _device_memory_resource.pyx to _peer_access_utils.pyx just to keep the DMR implementation clean. (2) I had to change _peer_access_utils from a .py to a .pyx file to accommodate the moved code, and git didn't identify it as a move/rename.

…-by-set-proxy

…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst

…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst # cuda_core/docs/source/release/1.0.0-notes.rst

…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/cuda/core/_memory/_device_memory_resource.pyx

Andy-Jost added this to the cuda.core v1.0.0 milestone May 4, 2026

Andy-Jost added P1 Medium priority - Should do feature New feature or request cuda.core Everything related to the cuda.core module breaking Breaking changes are introduced labels May 4, 2026

Andy-Jost self-assigned this May 4, 2026

Andy-Jost added P0 High priority - Must do! and removed P1 Medium priority - Should do labels May 4, 2026

Andy-Jost and others added 5 commits May 4, 2026 16:30

Andy-Jost marked this pull request as ready for review May 5, 2026 00:33

Andy-Jost requested a review from rparolin May 5, 2026 00:34

Andy-Jost commented May 5, 2026

View reviewed changes

Andy-Jost added 4 commits May 5, 2026 11:32

Merge remote-tracking branch 'origin/main' into ajost/peer-accessible…

b5757f7

…-by-set-proxy

Merge remote-tracking branch 'origin/main' into ajost/peer-accessible…

3a6af4f

…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst

Merge remote-tracking branch 'origin/main' into ajost/peer-accessible…

e90f0b9

…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst # cuda_core/docs/source/release/1.0.0-notes.rst

Merge remote-tracking branch 'origin/main' into ajost/peer-accessible…

e95b505

…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/cuda/core/_memory/_device_memory_resource.pyx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.core: convert peer_accessible_by to a live MutableSet view#2018

cuda.core: convert peer_accessible_by to a live MutableSet view#2018
Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/peer-accessible-by-set-proxy

Andy-Jost commented May 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

leofang commented May 4, 2026

Uh oh!

Andy-Jost commented May 5, 2026

Uh oh!

Andy-Jost May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang commented May 5, 2026

Uh oh!

Andy-Jost commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Andy-Jost commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New: PeerAccessibleBySetProxy

Cache removal

Module consolidation

Property setter

_query_peer_access_ids GIL optimization

Test coverage

Breaking change

Test plan

Related work

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

leofang commented May 4, 2026

Uh oh!

Andy-Jost commented May 5, 2026

Uh oh!

Andy-Jost May 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 5, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

leofang commented May 5, 2026

Uh oh!

Andy-Jost commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Andy-Jost commented May 4, 2026 •

edited

Loading

New: `PeerAccessibleBySetProxy`

`_query_peer_access_ids` GIL optimization

Andy-Jost commented May 5, 2026 •

edited

Loading